Chapter 9 Constructions and Idioms

9.1 Collostruction

In this chapter, I would like to talk about the relationship between a construction and words. Words may co-occur to form collocation patterns. When words co-occur with a particular morphosyntactic pattern, they would form collostruction patterns.

Here I would like to introduce a widely-applied method for research on the meanings of constructional schemas—Collostructional Aanalysis (Stefanowitsch and Gries 2003). This is the major framework in corpus linguistics for the study of the relationship between words and constructions.

The idea behind collostructional analysis is simple: the meaning of a morphosyntactic construction can be determined very often by its co-occurring words.

In particular, words that are strongly associated (i.e., co-occurring) with the construction are referred to as collexemes of the construction.

Collostruction Analysis is an umbrella term, which covers several sub-analyses for constructional semantics:

  • collexeme analysis
  • co-varying collexeme analysis
  • distinctive collexeme analysis

This chapter will focus on the first one, collexeme analysis, whose principles can be extended to the other analyses.

Also, I will demonstrate how we can conduct a collexeme analysis by using the R script written by Stefan Gries (Collostructional Analysis).

9.2 Corpus

I will use the Apple News Corpus from Chapter 8 as our corpus.

And in this demonstration, I would like to look at a particular morphosyntactic frame in Chinese, X + 起來. Our goal is simple: in order to find out the semantics of this constructional schema, it would be very informative if we can find out which words tend to strongly occupy this X slot of the constructional schema.

So our first step is to load the corpus into R.

9.3 Word Segmentation

Because Apple News Corpus is a raw-text corpus, we first word-segment the corpus.

9.4 Extract Constructions

With the words information, we can now extract our target patterns from the corpus using regular expressions.

9.5 Distributional Information Needed for CA

To perform the collostructional analysis, which is essentially a statistical analysis of the association between the words and the constructions, we need to collect necessary distributional information.

Also, to use Stefan Gries’ R script of Collostructional Analysis, we need the following information:

  1. Joint Frequencies of Words and Constructions
  2. Frequencies of Words in Corpus
  3. Corpus Size (total number of words in corpus)
  4. Construction Size (total number of constructions in corpus)

9.5.1 Word Frequency List

9.5.3 Other Information

We prepare necessary distributional information for the later collostructional analysis.

## Corpus Size:  3209617
## Construction Size:  546

9.5.4 Creat Output File

This is to create an empty output txt file to keep the results from the Collostructional Analysis script.

## [1] TRUE

9.5.5 Run coll.analysis.r

Finally we are now really to perform the collostructional analysis using Stefan Gries’ coll.analysis.r.

This is an R script with interactive instructions. When you run the analysis, you will be prompted with guide questions, to which you would need to fill out necessary information/answers.

Specifically, data to be entered include:

  • analysis to perform: 1
  • name of construction: QILAI
  • corpus size: 3209617
  • freq of constructions: 546
  • index of association strength: 1 (=fisher-exact)
  • sorting: 4 (=collostruction strength)
  • decimals: 2
  • text file with the raw data: <qilai.tsv>
  • output file: <qilai_results.txt>

The output of coll.analysis.r is as shown below:

9.6 Chinese Four-character Idioms

Many studies have shown that Chinese makes use of large proportion of four-character idioms in the discourse. This chapter will provide a exploratory analysis of four-character idioms in Chinese.

9.7 Dictionary Entries

In our demo_data directory, there is a file dict-ch-idiom.txt, which includes a list of four-character idioms in Chinese. These idioms are collected from 搜狗輸入法詞庫 and the original file formats (.scel) have been combined, removed of duplicate cases, and converted to a more machine-readable format, i.e., .txt.

Let’s first import the idioms in the file.

## [1] "阿保之功" "阿保之勞" "阿鼻地獄" "阿鼻叫喚" "阿斗太子" "阿芙蓉膏"
## [1] "罪無可逭" "罪人不帑" "作纛旗兒" "坐纛旂兒" "作姦犯科" "作育英才"
## [1] 56536

In order to make use of the tidy structure in R, we convert the data into a tibble:

9.8 Case Study: X來Y去

We can create a regular expression pattern to extract all idioms with the format of X來X去:

To analyze the meaning of this constructional schema, we may need to extract the X and Y in the schema:

One empirical question is how many of these idioms are of the pattern X=Y (e.g., 想來想去, 直來直去) and how many are of X!=Y (e.g., 說來道去, 朝來暮去):

9.9 Exercises


Exercise 9.1 Please use idiom and extract the idioms with the schema of 一X一Y.

Exercise 9.2 Also with the idiom as our data source, now if we are interested in all idioms that have duplicated characters in them, with schemas like either _A_A or A_A_, where A is a fixed character. How can we extract all idioms of these two types from idiom? Also, provide the distribution of the two types.


Exercise 9.3 Following Exercise 9.2, for each type of the idioms, please provide their respective proportions of X=Y vs. X!=Y.

Exercise 9.4 Folloing Exercise 9.3, please identify the character that is duplicated in the idioms. One follow-up analysis would be to look at the distribution of these pivotal characters. Can you reproduce a graph as shown below as closely as possible?

References

Stefanowitsch, Anatol, and Stefan Th Gries. 2003. “Collostructions: Investigating the Interaction of Words and Constructions.” International Journal of Corpus Linguistics 8 (2). John Benjamins: 209–43.